Cornell College
STA 200 Fall 2025 Block 1
Scatterplots are useful for visualizing the relationship between two numerical variables.
Do life expectancy and total fertility appear to be associated or independent?
They appear to be linearly and negatively associated: as fertility increases, life expectancy decreases.
Was the relationship the same throughout the years, or did it change?
The relationship changed over the years.
Useful for visualizing one numerical variable. Darker colors represent areas where there are more observations.
How would you describe the distribution of GPAs in this data set?
Make sure to say something about the center, shape, and spread of the distribution.
The sample mean, denoted as \(\bar{x}\), can be calculated as:
\[ \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} \]
where \(x_1, x_2, \cdots, x_n\) represent the n observed values.
The population mean is also computed the same way but is denoted as \(\mu\). It is often not possible to calculate \(\mu\) since population data are rarely available.
The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.
Higher bars represent areas where there are more observations, making it a little easier to judge the center and the shape of the distribution.
Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much?
Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)?
In order to determine modality, step back and imagine a smooth curve over the histogram — imagine that the bars are wooden blocks and you drop a limp spaghetti over them. The shape the spaghetti would take could be viewed as a smooth curve.
Is the histogram right skewed, left skewed, or symmetric?
Histograms are said to be skewed to the side of the long tail.
Are there any unusual observations or potential outliers?
How would you describe the shape of the distribution of hours per week students spend on extracurricular activities?
Unimodal and right skewed, with a potentially unusual observation at 60 hours/week.
unimodal
bimodal
multimodal
uniform
right skew
left skew
Which of these variables do you expect to be uniformly distributed?
Answer: birthdays of classmates
Sketch the expected distributions of the following variables:
Come up with a concise way (1-2 sentences) to teach someone how to determine the expected distribution of any variable.
http://www.youtube.com/watch?v=4B2xOvKFFz4
How useful are centers alone for conveying the true characteristics of a distribution?
Variance is roughly the average squared deviation from the mean.
\[ s^2 = \frac{\sum_{i = 1}^n (x_i - \bar{x})^2}{n - 1} \]
\[ s^2 = \frac{(5 - 6.71)^2 + (9 - 6.71)^2 + \cdots + (7 - 6.71)^2}{217 - 1} = 4.11~hours^2 \]
Why do we use the squared deviation in the calculation of variance?
The standard deviation is the square root of the variance, and has the same units as the data.
\[ s = \sqrt{s^2} \]
The standard deviation of amount of sleep students get per night can be calculated as:
\[
s = \sqrt{4.11} = 2.03~\text{hours}
\]
We can see that all of the data are within 3 standard deviations of the mean.
The median is the value that splits the data in half when ordered in ascending order.
\[0, 1, \textbf{2}, 3, 4\]
If there are an even number of observations, then the median is the average of the two values in the middle.
\[0, 1, \underline{2, 3}, 4, 5 \rightarrow \frac{2 + 3}{2} = \textbf{2.5}\]
Since the median is the midpoint of the data, 50% of the values are below it. Hence, it is also the \(50^{th}\) percentile.
The \(25^{th}\) percentile is also called the first quartile, Q1.
The \(50^{th}\) percentile is also called the median.
The \(75^{th}\) percentile is also called the third quartile, Q3.
Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range, or the IQR.
\[ IQR = Q3 - Q1 \]
The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.
\[ \begin{aligned} \text{max upper whisker reach} &= Q3 + 1.5 \times IQR \\ \text{max lower whisker reach} &= Q1 - 1.5 \times IQR \end{aligned} \]
\[ \begin{aligned} \text{IQR} &= 20 - 10 = 10 \\ \text{max upper whisker reach} &= 20 + 1.5 \times 10 = 35 \\ \text{max lower whisker reach} &= 10 - 1.5 \times 10 = -5 \end{aligned} \]
Question: Why is it important to look for outliers?
Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.
Question: How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?
| scenario | robust | not robust | ||||
|---|---|---|---|---|---|---|
| median | IQR | \(\bar{x}\) | \(s\) | |||
| original data | 190K | 200K | 245K | 226K | ||
| move largest to $10 million | 190K | 200K | 309K | 853K | ||
| move smallest to $10 million | 200K | 200K | 316K | 854K |
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
Question: If you would like to estimate the typical household income for a student, would you be more interested in the mean or median income?
Answer: Median
If the distribution is skewed or has extreme outliers, center is often defined as the median
Which is most likely true for the distribution of percentage of time actually spent taking notes in class versus on Facebook, Twitter, etc.?
Median: 80%
Mean: 76%
Answer: mean < median
When data are extremely skewed, transforming them might make modeling easier. A common transformation is the log transformation.
The histograms on the left show the distribution of number of basketball games attended by students. The histogram on the right shows the distribution of log of number of games attended.
Skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation.
| # of games | 70 | 50 | 25 | ⋯ |
|---|---|---|---|---|
| log(# of games) | 4.25 | 3.91 | 3.22 | ⋯ |
However, results of an analysis in log units of the measured variable might be difficult to interpret.
Q: What other variables would you expect to be extremely skewed?
A: Salary, housing prices, etc.
Q: What patterns are apparent in the change in population between 2000 and 2010?
A table that summarizes data for two categorical variables is called a contingency table.
The contingency table below shows the distribution of survival and ages of passengers on the Titanic.
| Age | Died | Survived | Total |
|---|---|---|---|
| Adult | 1438 | 654 | 2092 |
| Child | 52 | 57 | 109 |
| Total | 1490 | 711 | 2201 |
A bar plot is a common way to display a single categorical variable.
A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.
Discussion Question:
How are bar plots different than histograms?
-Bar plots are used for displaying distributions of categorical variables,
histograms are used for numerical variables.
-The x-axis in a histogram is a number line, hence the order of the bars cannot be changed.
-In a bar plot, the categories can be listed in any order (though some orderings make more sense than others, especially for ordinal variables).
Discussion Question:
Does there appear to be a relationship between age and survival for passengers on the Titanic?
| Age | Died | Survived | Total |
|---|---|---|---|
| Adult | 1438 | 654 | 2092 |
| Child | 52 | 57 | 109 |
| Total | 1490 | 711 | 2201 |
To answer this question we examine the row proportions:
Stacked bar plot: Graphical display of contingency table information, for counts.
Side-by-side bar plot: Displays the same information by placing bars next to, instead of on top of, each other.
Standardized stacked bar plot: Graphical display of contingency table information, for proportions.
Discussion Question:
What are the differences between the three visualizations shown below?
Discussion Question:
What is the difference between the two visualizations shown below?
Discussion Question:
Can you tell which order encompasses the lowest percentage of mammal species?
Data from http://www.bucknell.edu/msw3.
Discussion Question:
Does there appear to be a relationship between class year and number of clubs students are in?
In 1972, as a part of a study on gender discrimination, 48 male bank supervisors were each given the same personnel file and asked to judge whether the person should be promoted to a branch manager job that was described as “routine”.
The files were identical except that half of the supervisors had files showing the person was male while the other half had files showing the person was female.
It was randomly determined which supervisors got “male” applications and which got “female” applications.
Of the 48 files reviewed, 35 were promoted.
The study is testing whether females are unfairly discriminated against.
Discussion Question:
Is this an observational study or an experiment?
Answer: Experiment
B. Rosen and T. Jerdee (1974), “Influence of sex role stereotypes on personnel decisions”, J. Applied Psychology, 59:9–14.
Discussion Question:
At a first glance, does there appear to be a relationship between promotion and gender?
| Gender | Promoted | Not Promoted | Total |
|---|---|---|---|
| Male | 21 | 3 | 24 |
| Female | 14 | 10 | 24 |
| Total | 35 | 13 | 48 |
% of males promoted: 21 / 24 = 0.875
% of females promoted: 14 / 24 = 0.583
Practice Question:
We saw a difference of almost 30% (29.2% to be exact) between the proportion of male and female files that are promoted. Based on this information, which of the below is true?
If we were to repeat the experiment we will definitely see that more female files get promoted. This was a fluke.
Promotion is dependent on gender, males are more likely to be promoted, and hence there is gender discrimination against women in promotion decisions.
Hypothesis testing is very much like a court trial.
H₀: Defendant is innocent
Hₐ: Defendant is guilty
We then present the evidence — collect data.
Image from http://www.nwherald.com/_internal/cimg!0/oo1il4sf8zzaqbboq25oevvbg99wpot_]()
In a trial, the burden of proof is on the prosecution.
In a hypothesis test, the burden of proof is on the unusual claim.
The null hypothesis is the ordinary state of affairs (the status quo), so it’s the alternative hypothesis that we consider unusual and for which we must gather evidence.
We start with a null hypothesis (H₀) that represents the status quo.
We also have an alternative hypothesis (Hₐ) that represents our research question, i.e., what we’re testing for.
We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation (today) or theoretical methods (later in the course).
If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis. If they do, then we reject the null hypothesis in favor of the alternative.
#Testing via simulation
… under the assumption of independence, i.e., leave things up to chance.
If results from the simulations based on the chance model look like the data, then we can determine that the difference between the proportions of promoted files between males and females was simply due to chance (promotion and gender are independent).
If the results from the simulations based on the chance model do not look like the data, then we can determine that the difference between the proportions of promoted files between males and females was not due to chance, but due to an actual effect of gender (promotion and gender are dependent).
Activity:
Use a deck of playing cards to simulate this experiment.
Let a face card represent not promoted and a non-face card represent promoted. Consider aces as face cards.
Shuffle the cards and deal them into two groups of size 24, representing males and females.
Count and record how many files in each group are promoted (number cards).
Calculate the proportion of promoted files in each group and take the difference (male − female), and record this value.
Repeat steps 2–4 many times.
#Checking for independence
Practice Question:
Do the results of the simulation you just ran provide convincing evidence of gender discrimination against women, i.e., dependence between gender and promotion decisions?
Answer: Yes, the data provide convincing evidence for the alternative hypothesis of gender discrimination against women in promotion decisions. The observed difference between the two proportions was due to a real effect of gender.
These simulations are tedious and slow to run using the method described earlier. In reality, we use software to generate the simulations. The dot plot below shows the distribution of simulated differences in promotion rates based on 100 simulations.